10. HTML Files in Python
HTML Files in Python
HTML Files in Python 1
Quiz
With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.
The Jupyter Notebook below contains template code that:
- Creates an empty list, df_list , to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row ).
- Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
- Opens each HTML file and passes it into a file handle called file .
-
Creates a DataFrame called
df
by converting
df_list
using the
pd.DataFrame
constructor .
Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list .
The Beautiful Soup methods required for this task are:
-
find()
-
find_all()
There is an excellent tutorial on these methods ( Searching the tree ) in the Beautiful Soup documentation. Please consult that tutorial if you are stuck.
Workspace
This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.
Workspace Information:
- Default file path:
- Workspace type: jupyter
- Opened files (when workspace is loaded): n/a
Resources
Solution
HTML Files In Python 2
Note: At 3:59, "empty character" was said when "empty string" was intended.